Data updating method using overlap area and program converting device for converting update program in distributed-memory parallel processor

ABSTRACT

In a parallel processor, a local area and an overlap area are assigned to the memory of each processing element (PE), and each PE makes calculations to update the data in both areas at the runtime. If the data in the overlap area is updated in processes closed in the PEs, the data transfer between adjacent PEs can be reduced and the parallel processes can be performed at a high speed.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to a method of executing a program at ahigh-speed through a distributed-memory parallel processor, and morespecifically to a data updating method using an overlap area and aprogram converting device for converting a data update program.

2. Description of the Related Art

Recently, a parallel processor draws people's attention as a system ofrealizing a high-speed processor such as a super-computer in the form ofa plurality of processing elements (hereinafter referred to as PE orprocessors) connected through a network. In realizing a high processingperformance using such a parallel processor, it is an important problemto reduce the overheads of the data communications to the lowestpossible level. One of the effective PE of reducing the overheads of thedata communications between processors is to use an overlap area forspecific applications.

The time required for data communications depends on the number of timesof packet communications rather than the total volume of data.Therefore, integrating the communications and representing the messagesby vector (S, WP, K. Kennedy, and C. WP “Compiler WP for Fortran D on WPDistributed-Memory Machines,” in Proc. WP '91 pp. 86-100, Nov. 1991.)are important in reducing the communications overheads. An overlap areais a special type of buffer area for receiving vector data, and isassigned such that it encompasses a local data area (local area) to beused in computing data internally. The data value of the overlap area isdetermined by the adjacent processor.

FIG. 1 shows the program code (Jacobi code) of the Jacobi relaxationwritten in high performance Fortran (HPF). In the Jacobi code shown inFIG. 1, the values of element a (i, j) of the two-dimensional array Aare updated using the values of four adjacent elements a (i, j+i), a(i,j−1), a(i+1, j), and a(i−,j). The size of the array a is specified by a(256, 256). The elements where i=1, 256, j=1, 256 are not updated. Forexample, the element a(2:255) in the DO loop of the update of data is anarray description of Fortran 90, and the number of times of occurrencesof the DO loop is t times. This code refers to a typical example of theupdate of data using an overlap area.

FIG. 2 shows an example of the overlap area in which the Jacobi codeshown in FIG. 1 is executed. According to the data distributionspecified in the program shown in FIG. 20, the elements in the array a(256, 256) are distributed into the local areas of 16 processors P (x,y) (x=1, 2, 3, 5, y=1, 2, 3, 4) and stored therein. For example, theprocessor p (2, 2) controls the range of a (65:128, 65:128) in the arraya. In FIG. 2, the shadowed portion around the local area of theprocessor p(2,2) indicates the overlap area at p(2, 2).

The processor p(2, 2) has a considerably large area of a(64:129, 64:129)including an overlap area so that, when a(i, j) is calculated, theadjacent a(i, j+1), a(i, j−1), a(i+1, j), and a(i−1, j) can be locallyaccessed.

Without an overlap area, data should be read from adjacent processors inthe DO loop and a small volume of data are frequently communicated,resulting in a large communications overheads. However, having anoverlap area allows the latest data to be copied to the overlap area bycollectively transferring data before an updating process. Therefore,data can be locally updated and the communications overheads can beconsiderably reduced.

Thus, the overlap area can be explicitly specified by VPP Fortran(“Realization and Evaluation of VPP Fortran Process System for AP1000”Vol. 93-HPC-48-2, pp. 9-16. Aug. 1993 published at SWOPP Tomonoura '93HPC Conference by Tatsuya Sindoh, Hidetoshi Iwashita, Doi, and Jun-ichiOgiwara). A certain compiler automatically generates an overlap area asa form of the optimization.

The data transmission patterns for performing parallel processes can beclassified into two types. One is a single direction data transfer SDDT,and the other is a bi-directional data transfer BDDT. FIG. 3 shows anexample of the SDDT, and FIG. 4 shows an example of the BDDT.

In FIGS. 3 and 4, processors i−1, i+1, and i+2 are arranged in aspecified dimension and forms a processor array. The SDDT is a transfermethod in which all transfer data are transferred in a single directionfrom the processor i toward the processor i+1 with time in the specifieddimension. The BDDT is a transfer method in which data is transferredbetween adjacent processors in two directions. Thus, some pieces of dataare transmitted from the processor i to the processor i+1 while otherpieces of data are transmitted from the processor i+1 to the processori.

FIG. 5 shows the program code of the Jacobi relaxation for aone-dimensional array. In the Jacobi code shown in FIG. 5, the value ofthe element a(i) of the one-dimensional array a is updated by the outputof a function f obtained by inputting to the function f the two adjacentelements a(i−1) and a(i+1). The size of the array a is specified bya(28), and a(1) and a(28) are not updated. The data is updatedrepeatedly for the time specified by time.

FIG. 6 shows an example in which data is updated using the conventionaloverlap area when a program shown in FIG. 5 is executed. In FIG. 6, PE0,PE1, PE2, and PE3 are four PEs for dividing and managing the array a.Each PE has an area for storing 9 array elements. A dirty overlap areastores old data and a clean overlap area stores the same latest data asthe adjacent PE. A local area stores data to be processed by each PE.

The word “INIT” indicates an initial state and “Update” indicates thedata communications between adjacent PEs to update the overlap area.Iter 1, 2, 3, and 4 indicate parallel processes for the update of dataat each iteration of the DO loop. In FIG. 6, the overlap area is updatedby the BDDT for each iteration.

However, the data update method using the conventional overlap area hasthe following problems.

Each processor forming part of the parallel processor should update thedata in the overlap area into the latest value before making acalculation using the data value of the overlap area. The update processis performed by reading the latest value from the adjacent processorthrough the communications between processors. In parallel processors,the overheads are heavy for a rise time. Therefore, the time requiredfor the communications process depends on the number of times of datatransfers rather than the amount of transferred data. If an overlap areais updated each time a calculation is made using the overlap area, theneach communications rise time is accompanied by overheads.

In a parallel processor connected through a torus network such as anAP1000 (“An Architecture of Highly Parallel Computer AP1000,” by H.Ishihata, T. Horie, T. Shimizu, and S. Kato, in Proc. IEEE Pacific RimConf. on Communications, Computers, and Signal Processing, pp. 13-16,May 1991), the SDDT excels to the BDDT in characteristic because theSDDT can reduce the time of data transfers and the overheads required ina synchronization process between adjacent processors more than theBDDT. However, in the conventional data update process as shown in FIG.6, the data in the overlap areas should be exchanged between adjacentprocessors, and the data transfer pattern is based on the BDDT. In theBDDT, each processor should perform communications in synchronism withadjacent processors. As a result, the time of data transfers increasesand the overheads for the synchronization processes become heavier thanthe SDDT.

3. Summary of the Invention

The present invention aims at updating data with the overheads for thecommunications between PEs reduced in the distributed-memory parallelprocessors, and providing a program converting device for generating adata updating program.

The program converting device according to the present invention isprovided in an information processing device, and converts an inputprogram into the program for a parallel processor. The programconverting device is provided with a detecting unit, setting unit, sizedetermining unit, and a communications change unit.

The detecting unit detects a portion including the description of theloop where optimization can be realized using an overlap area in theinput program. The setting unit assigns an overlap area to the memory ofthe PE for processing the program at the description of the loop,generates a program code for calculating the data in the area, and thenadds it to the initial program. Thus, each PE updates the data in thelocal area managed by the PE, and also updates the data in the overlaparea managed by other PEs at the runtime of the program converted by theparallel processor. The overlap area updated by the closed calculationin each PE requires no data transfer for update, thereby improving theefficiency in parallel process.

The size determining unit estimate the runtime for the description ofthe loop and determines the optimum size of the overlap area. Normally,the larger the overlap area is, the smaller number of times the data istransferred while the longer time is taken for updating the data in thearea. If the size of an overlap area is fixed such that the runtime isthe shortest possible, the data update process can be efficientlyperformed.

The communications change unit checks the data dependency at thedetected portion of the description of the loop. If the data isdependent bi-directionally, the description should be rewritten suchthat the data is dependent in a single direction, and subscripts aregenerated in the arrangement optimum for data transfer. Thus, each PEonly has to communicate with the adjacent PE corresponding to eitherupper limit or lower limit of the subscripts in the array, therebysuccessfully, reducing the overheads of the communications.

Thus, the overlap area has been updated using the data transferredexternally. However, it is updated in a calculation process in each PE,thereby reducing the overheads for the communications and performing theparallel process at a high speed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows the Jacobi code in the two-dimensional array;

FIG. 2 shows the conventional overlap area;

FIG. 3 shows the data transfer in a single direction;

FIG. 4 shows the data transfer in two directions;

FIG. 5 shows the Jacobi code in a one-dimensional array;

FIG. 6 shows the update of data using the conventional overlap area;

FIG. 7 shows the configuration of the program converting deviceaccording to the embodiment of the present invention;

FIG. 8 shows the configuration of the parallel processor according tothe embodiment of the present invention;

FIG. 9 shows the configuration of the host computer;

FIG. 10 shows the update of data using an extended overlap area;

FIG. 11 is an operating flowchart showing the extended overlap areasetting process;

FIG. 12 is an operating flowchart showing the extended overlap areaavailable portion detecting process;

FIG. 13 shows a two-dimensional extended overlap area;

FIG. 14 shows the relationship between the extended parameter and theruntime;

FIG. 15 is an operating flowchart showing the extended overlap areaassigning process;

FIG. 16 shows the update of data through the data transfer in a singledirection;

FIG. 17 is an operating flowchart showing the data update settingprocess through the data transfer in a single direction;

FIG. 18 shows the original program;

FIG. 19 shows the program after converting the calculation space;

FIG. 20 shows the program with uncalculated elements added;

FIG. 21 shows the program after converting indices;

FIG. 22 shows the distance vector for the original program;

FIG. 23 shows the distance vector after the conversion;

FIG. 24 is an operating flowchart showing the setting process;

FIG. 25 shows the update of data using an extended overlap area and datatransfer in a single direction;

DESCRIPTION OF THE PREFERRED EMBODIMENTS

The embodiments of the present invention is described in detail byreferring to the attached drawings.

FIG. 7 shows the configuration of the program converting deviceaccording to the embodiment of the present invention. The programconverting device shown in FIG. 7 is provided in an informationprocessing device and converts an input program into a program for aparallel processor comprising a plurality of PEs and a communicationsnetwork. The program converting device comprises a detecting unit 1, asetting unit 2, a size determining unit 3, and communications changeunit 4.

The detecting unit 1 detects a loop including one to be possiblyoptimized using an overlap area from the input program.

The setting unit 2 assigns an overlap area to a PE for processing theloop, generates a code for calculating the data in the overlap area, andoutputs a converted program.

The size determining unit 3 estimates the runtime for processing theloop, determines the optimum size of the overlap area. The setting unit2 assigns the overlap area having the size determined by the sizedetermining unit 3 to the PE for processing the loop.

The communications change unit 4 changes the data dependency of theprocess of the loop from the bi-directional dependency to theuni-directional dependency to generate the subscripts of the arraythrough which data is transferred uni-directionally (SDDT). The settingunit 2 generates a code for updating data through the uni-directionaldata transfer and adds the generated code to the converted program.

The detecting unit 1, setting unit 2, size determining unit 3, andcommunications change unit 4 shown in FIG. 7 correspond to a hostcomputer 11 shown in FIGS. 8 and 9, and more specifically to eachfunction of a compiler 25 shown in FIG. 9. The compiler 25 is activatedby a central processing unit (CPU) 22.

The detecting unit 1 scans an input program and detects a loop to whichan overlap area can be applied. For example, the portion where acalculation process capable of performing a parallel process by aplurality of PEs is encompassed by the serial DO loop is detected asshown in FIGS. 1 and 5.

The setting unit 2 assigns an overlap area to each PE for sharing theprocess for the detected loop, and generates a code for calculating andupdating the data in the overlap area. Thus, at the runtime of aconverted program, each PE locally calculates the data in the overlaparea as well as the data in the local area. Therefore, the time ofcommunications in which data in the overlap area is updated can bereduced, thereby also reducing the communications overheads.

The size determining unit 3 estimates the runtime for processing theloop, and determines, for example, the minimal size of the overlap area.If the overlap area of the optimum size is assigned to each PE, thethroughput of the parallel processor can be considerably improved.

The setting unit 2 generates a code for updating data through the SDDT,act through the BDDT. As a result, each PE does not have to communicatewith both adjacent PEs at the upper limit and lower limit of thesubscript in the array. Therefore, the synchronizing process is notrequired for the PE not to communicate with, and saving the overheadsfor the synchronization.

The communications change unit 4 checks the data-dependent vector in thearray data used in the above described loop, converts the data intouni-directional transfer from bi-directional transfer, and generates asubscript in the array to perform the SDDT. The setting unit 2 generatesa code for updating data using the subscripts in the converted array.

Thus, a program for use in a parallel processor is generated by theprogram converting device shown in FIG. 7. As a result, the overheadsfor the communications can be reduced.

FIG. 8 shows the configuration of the parallel processor for realizingthe present invention. The parallel processor shown in FIG. 8 comprisesthe host computer 11, a network 12, and a processor array 13 comprisinga plurality of PEs (PE0, PE1, . . . , PEn). The host computer 11controls the entire parallel processor and performs input and outputprocesses. The processor array 13 performs a parallel process. The hostcomputer 11 is connected to each PE through the network 12.

FIG. 9 shows the configuration of the host computer 11. The hostcomputer 11 shown in FIG. 9 comprises an input/output unit 21, the CPU22, and a memory 23. These units are connected via an internal bus 24.In the host computer 11, the CPU 22 converts (compiles) a single program26 into a program to be executed in parallel by each PE using thecompiler 25 stored in the memory 23. The compiled program code isdownloaded to each PE through the input/output unit 21 and is executedby each PE. The network 12 is used in downloading a program code ortransferring data between processors at runtime.

Two methods are used for the embodiments of the present invention. Oneis to use an extended overlap area, and the other is -to update datathrough the SDDT of the overlap area.

An extended overlap area is described by referring to FIGS. 10 through15.

An extended overlap area is a data storage area which is multiple ofcommon overlap areas. Using an extended overlap area reduces the totalnumber of times of the communications performed to update the overlaparea when a process effectively using the overlap area is repeated forplural times in a loop of the loops performed in parallel in a program.In the program according to the Jacobi relaxation shown in FIG. 1, datacan be updated for e times after a data transfer if the width of theoverlap area shown in FIG. 20 is extended e times.

FIG. 10 shows the update of data when the Jacobi codes shown in FIG. 5are executed using the extended overlap area. In FIG. 10, theconventional overlap areas of P0, P1, P2, and P3 shown in FIG. 6 arethree times extended. These PEs perform parallel processes using theextended overlap area. The meanings of the dirty overlap area, cleanoverlap area, local area, INIT, Update, and Iter 1, 2, 3, and 4 are thesame as those in FIG. 6. By locally calculating a part of the data inthe extended overlap area shown in FIG. 10, data can be transferred onlyonce each time the data is updated three times.

For example, PE1 updates the elements in the range of a(9:16) of thearray a, holds the elements in the local area, and contains the storagearea as an extended overlap area for the elements in the range of a(6:8)and a(17:19) (INIT), PE1 first establishes communications between PE0and PE2 and updates the data in the extended overlap area as the latestdata (Update).

Then, the first calculation is made using the data in the range ofa(6:19) to update the data a(9:16) in the local area and the data a(7:8) and a (17:18) in the extended overlap area (Iter 1). Then, thesecond calculation is made using the data in the range of a (7:18) toupdate the data a (9:16) in the local area and the data a (8) and a (17)in the extended overlap area (Iter 2). Then, the second calculation ismade using the data in the range of a (8:17) to update only the data a(9:16) (Iter 3).

Since all data in the extended overlap area are dirty, that is,unavailable, PE1 establishes communications again with the adjacent PEto update the extended overlap area and similarly repeats the dataupdating process of and after Iter 4.

In FIG. 10, each element a(i) before the update of data and each elementa(i) after the update of data are stored in physically different memoryareas. That is, in each PE, the local area and extended overlap area areboth provided with data before the update and data after the update.However, the memory area is automatically segmented at the compilationof the program 26.

When the conventional overlap area is compared with the extended overlaparea of the present invention, the total quantity of data transferredfor the update of the overlap area remains the same. Since thecommunications overhead time depends more seriously on the time oftransfer than the total quantity of the transferred data, thecommunications overheads can be more efficiently reduced by using theextended overlap.

Since each PE should make a calculation on a part of the extendedoverlap area referred to by the subsequent iteration, the calculationand process for the present PE are the same as those for the update ofdata in the adjacent PE. Therefore, the larger the size of the extendedoverlap area is, the more heavily the parallelism of the processes isimpaired. As an extreme example, if an extended overlap area is set suchthat a single PE stores the data for all PEs, no communications arerequired with no parallelism obtained, though. To determine the optimumsize of the extended overlap area, the communications overheads and theprocess parallelism should be traded off.

FIG. 11 is an operating flowchart showing the extended overlap areasetting process performed by the host computer 11 to realize theextended overlap area. The extended overlap area setting process shownin FIG. 11 is performed as a part of the optimizing process by thecompiler 25. The input of the optimizing process refers to aninstruction (directive) to start the program 26 (serial program) and theprocess, while the output refers to a program to be executed by each PE.After the program 26 is converted in the optimizing process into theprogram for each PE, the program and necessary data is downloaded toeach PE to perform a parallel process.

If a process starts according to FIG. 11, the host computer 11 detectsan applicable point for the extended overlap area contained in theprogram 26 (step S1). That is, the portion where a parallel loopapplicable for the optimization using an overlap area is encompassed bya serial loop is detected as a portion to be effectively processed usingthe extended overlap area. The parallel loop refers to loop processeswhich can be processed in parallel by a plurality of PEs as theoperations described in lines 6 and 7 shown in FIG. 1. The serial looprefers to loop processes repeatedly performed in series as a DO loopdescribed in lines 5 and 8 shown in FIG. 1.

FIG. 12 is an operating flowchart showing an example of the process instep S1 shown in FIG. 11. When a process starts according to FIG. 12,the host computer 11 selects a DO loop contained in the program 26 (stepS1-1) and checks whether or not the process can be performed in parallel(step S1-2). In the case shown in FIG. 1, the parallel loop correspondsto the selected DO loop. If it can be processed in parallel, then it isdetermined whether or not it is a DO loop applicable effectively usingan overlap area (step S1-3). A Do loop effectively applicable using anoverlap area refers to a process locally performed by each PE byproviding an overlap area. If the overlap area is effective, then it isdetermined whether or not the DO loop (parallel loop) is tightlyencompassed by a serial loop (step S1-4). If the DO loop is tightlyencompassed, the selected DO loop is entered in an appropriate area in amemory 23 as an applicable portion for the extended overlap area (stepS1-5).

Then, it is checked whether or not any other DO loops exist (step S1-6).If yes, the processes in and after step S1-1 is repeated. If not, theprocess terminates. If the determination results are “No” in steps S1-2,S1-3, and S1-4, control is passed to the process in step S1-6.

As a result of the extended overlap area applicable portion detectingprocess shown in FIG. 12, for example, the parallel loops shown in FIGS.1 and 4 are entered as applicable portions.

In FIG. 11, the host computer 11 performs the process in step S1,generates an executable model using the extended overlap area, andestimates the runtime for the program (step S2). Described below is themethod of estimating the runtime by referring to an example of theextended overlap area for a common two-dimensional array.

FIG. 13 shows an extended overlap area provided at and around the localarea of a PE storing a two-dimensional array. In FIG. 13, l shows thesize (number of array elements in each dimension) of a local area, wshows the size (width) of the conventional overlap area, e is aparameter showing how much the conventional overlap area is extended,and ew shows the size (width) of the extended overlap area.

Assuming that the calculation time for a unit area (1 element) in alocal area or an extended overlap area is a, the overhead time (prologueand epilogue time) taken for the activation and termination of one datatransfer process is c, and that the time taken for a data transfer perunit area is d, then the data transfer time is calculated by theequation c+d×size of transferred area.

In the optimization using an extended overlap area, communications areestablished first for e times of data update (e iteration). The data tobe communicated is the extended overlap area shown as a shadowed portionin FIG. 13. The area (number of elements) of this portion is 4ew(ew+l).The communications are established 8 times between the eight PEsprocessing array elements in the upper, lower, right, left, and in fourdiagonal directions. Therefore, the total communications time requiredfor the e iteration is calculated by the following equation:

8e+4dew(ew+l)   (1)

The communications time for an iteration is obtained by dividingequation (1) by e as follows. $\begin{matrix}\frac{{8c} + {4{{dew}( {{ew} + 1} )}}}{e} & (2)\end{matrix}$

Then, the time taken for the calculations for the update of data isestimated. Since the number of calculation elements in the local area is1², the calculation time required for e iteration of the calculation forthe local area is ae1². If the size (width) of the updated area in theextended overlap area is kw in each iteration, the number of calculationelements in the extended overlap area is 4kw(kw+1), where k is aparameter representing how many times of the conventional overlap areathe update portion is in the extended overlap area in each iteration. Ifdata is locally updated without communications, the extended overlaparea sequentially becomes dirty from outside to inside for eachiteration, the width of the updated portion in the extended overlap areadecreases by w each time. Accordingly, the calculation time taken forcalculating the extended overlap area during the e iteration is obtainedby the following equation; $\begin{matrix}{4a{\sum\limits_{k = 1}^{e - 1}{{kw}( {{kw} + 1} )}}} & (3)\end{matrix}$

Since the calculation time taken for calculating the extended overlaparea at the e-k-th iteration is 4akw(kw+1), the sum of k from 1 to e−1is calculated by equation (3). The calculation time for an iteration isobtained by adding the calculation time for the local area for eiterations to the calculation time for the extended overlap area andthen by dividing the sum by e as follows. $\begin{matrix}\frac{{4a{\sum\limits_{k = 1}^{e - 1}{{kw}( {{kw} + 1} )}}} + {ael}^{2}}{e} & (4)\end{matrix}$

According to equations (2) and (4), the runtime T_(ier) (e) for oneserial loop using the extended overlap area is represented as a functionof e as follows. $\begin{matrix}{{T_{iwr}(e)} = {{\frac{{8c} + {4{{dew}( {{ew} + 1} )}}}{e} + \frac{{4a{\sum\limits_{k = 1}^{e - 1}{{kw}( {{kw} + 1} )}}} + {ael}^{2}}{e}} = {\frac{8c}{e} + {( {{4{dw}^{2}} - {2w^{2}a} + {2{alw}}} )e} + {\frac{4w^{2}a}{3}e^{2}} + ( {{4{dlw}} + {al}^{2} + {\frac{2}{3}w^{2}a} - {2{alw}}} )}}} & (5)\end{matrix}$

If the calculation time is estimated, the host computer 11 determinesthe optimum size of the extended overlap area according to the estimateresult (step S3). The optimum size of the extended overlap area refersto the size for the shortest possible runtime.

For example, assuming that, in equation (5), the coefficient of the termof e² is s, the coefficient of the term of e is t, the coefficient ofthe term of 1/e is u, and the term O is V for e, then equation (5) isrewritten as follows.

T_(iter)(e)=se²+te+μ+ν  (6)

FIG. 14 is the graph showing the relationship between T_(iter) (e) and ein equation (6). The value e_(O) for e corresponding to the minimumvalue T_(iter) (e_(O)) for T_(iter) (e) shown in FIG. 14 can be obtainedby solving the following equation for e. $\begin{matrix}{\frac{{T_{iter}(e)}}{e} = 0} & (7)\end{matrix}$

The obtained e_(O) is a value of the extension parameter for optimizingthe size of the extended overlap area. The size of the extended overlaparea is provided by the e_(O)w.

If the optimum size of the extended overlap area is determined, the hostcomputer 11 assigns the extended overlap area of the size to each PE(step S4).

FIG. 15 is an operating flowchart showing an example of the extendedoverlap area assigning process performed in step S4 shown in FIG. 11.When the process starts according to FIG. 15, the host computer 11 firstadds the optimum size of the extended overlap area to the data size(original data size) of the local area of each PE as a new data size(step S4-1). The optimum size of the extended overlap area is obtainedas a product of the optimum extension parameter obtained in step S3 bythe width of the conventional overlap area. According to the exampleshown in FIG. 13, the optimum size is e_(O)w. Then, the data is declaredagain with the new data size (step S4-2), and the process terminates.

Thus, the extended overlap area of each PE is assigned the data, of thesize of the extended overlap area, of the local area for another PE.

If the process is step S4 is completed, the host computer 11 inserts aprogram code for use in calculating an extended overlap area (step S5).Thus, a code is generated such that the range of the process of each PEin a parallel loop can be extended by the size of the extended overlaparea. The generated code is put into the program.

For example, the range of the indices in an array managed by the PE1shown in FIG. 10 is originally 9-16. If a code is generated for therange 7-18 in Iter 1, new values of indices 7, 8, 17 and 18 are obtainedfrom the values in the extended overlap area. Likewise, the calculationfor the range 8-17 in Iter 2 can be made. Thus, the new values for theindices 8 and 17 can be obtained using the values in the extendedoverlap area obtained in Iter 1.

Then, a program code is inserted to update the extended overlap area(step S6). In this process, a code is generated such that communicationsare established each time a serial loop encompassing a parallel loop isrepeated for the times indicated by the extension parameter to updatethe data in the extended overlap area. Then, the code is put into theprogram of each PE.

For example, communications are established for each iteration of threeserial loops in the example shown in FIG. 10. In the example shown inFIG. 13, communications are established for each iteration of e_(o)serial loops. After step S6, the host computer 11 terminates theprocess.

The update by the SDDT of the overlap area is described below byreferring to FIGS. 16 through 23. Even if the overlap areas are providedon both sides of the local area as shown in FIG. 6, the communicationscan be converted into the uni-directional communications by shifting thedata layout into the communications direction between the PEs each timethe data is updated.

FIG. 16 shows an example of updating data by the SDDT when the programshown in FIG. 5 is executed. In FIG. 16, the meanings of the dirtyoverlap area, clean overlap area, local area, INIT, Update, and Iter 1,2, 3, . . . are the same as those shown in FIG. 6.

The storage position of the data is shifted for each iteration into onedirection in a torus form in the system using the SDDT. To convert theconventional system in which an overlap area is updated by the BDDT intothe system using the SDDT, the data required to obtain a new value issent to the adjacent PE for one direction of the two-directionalcommunications instead of receiving the data required to calculate thenew value from the adjacent PE. As a result, the communications areestablished uni-directionally and the overlap area can be provided foronly one side of the local area.

For example, at the initial state, PE1 holds the elements in the rangeof a (9:16) in the array a, and has the storage area for the elements inthe range of a (7:8) as an overlap area (INIT). Then, the PE1 receivesthe data from the PE0, updates the data in the overlap area, andtransmits the data in the range of a (15:16) to the PE2 (Update).

Then, the PE1 makes the first calculation using the data in the range ofa (7:16), updates the data in a (8:15) (Iter 1), and stores the dataafter shifting the storage position in the communications directionby 1. At this time, the data in a (16) initially stored by the PE1 isupdated in parallel by the PE2. Since the data in the overlap area havebecome all dirty, the PE1 established uni-directional communicationsbetween the PE1 and the adjacent PE, updates the extended overlap area,and performs the data update process for Iter 2.

Since repeating these processes sequentially shifts the storagepositions of all data over the PE0 through PE3 in a torus form, afterdata update process for Iter 4, the data in a (27:28) of the PE3 istransferred to the overlap area of the PE0.

The result of the data updated by the SDDT shown in FIG. 16 matches theresult conventionally updated by the BDDT. Between the BDDT and SDDT,the total volume of transferred data is the same, but the transfer timewith the SDDT can be reduced into half the transfer time with the BDDT.Therefore, by using the SDDT, the overheads required to activate thedata transfer and the overheads required in the synchronization processbetween adjacent PEs can be reduced.

FIG. 17 is an operating flowchart showing the setting process inupdating the data by the SDDT. The setting process shown in FIG. 17 canbe performed by the host computer 11 as a part of the optimizing processby the compiler 25.

When the process starts as shown in FIG. 17, the host computer 11detects the point of data update by the SDDT contained in the program 26(step S11). That is, the point where a parallel loop applicable for theoptimization using the overlap areas is encompassed by a serial loop andthe overlap areas are provided for both sides (at upper and lowerlimits) of the local area of each PE is detected as a point where theupdate by the SDDT effectively works.

Then, a computational transformation is made (step S12). In thisprocess, the position of the data calculated according to the count ofthe outer serial loop is shifted and the SDDT is used in updating theoverlap areas.

In the computational transformation, a loop nest for determining acomputational space is converted such that all data-dependent vectorscan be positive in the direction along the axis of the processor array13. For the loop where the data-dependency is represented by a distancevector, the computational space conversion can be performed as anapplication of unimodular transformation (M. E. Wolf and M. S. Lam. “Aloop transformation theory and an algorithm to maximize parallelism,” inIEEE Transaction on Parallel and Distributed Systems, pp. 452-471, Oct.1991). In this case, the transform matrix T can be represented asfollows with the dimension of the array set to m, and with the parameterof the skew in each dimension set to a₁, a₂, . . . , a_(m).$\begin{matrix}{T = \begin{bmatrix}1 & 0 & 0 & \ldots & 0 \\a_{1} & 1 & 0 & \ldots & 0 \\a_{2} & 0 & 1 & \ldots & 0 \\\vdots & \vdots & \vdots & ⋰ & 0 \\a_{m} & 0 & 0 & \ldots & 1\end{bmatrix}} & (8)\end{matrix}$

The skew vector S containing the parameters a₁, a₂, . . . , a_(m) ofequation (8) can be defined as follows. $\begin{matrix}{S = \begin{bmatrix}a_{1} \\a_{2} \\\vdots \\a_{m}\end{bmatrix}} & (9)\end{matrix}$

FIG. 18 shows the program (original program) rewritten from the parallelloop of the program shown in FIG. 5 in the FORALL syntax. FIG. 22 showsthe distance vector (i, j) representing the data-dependency of theoriginal program. In FIG. 22, i indicates a time axis and corresponds tothe serial loop repetition parameter, and j indicates a space axis overthe PEs and is mapped along the memory space axis corresponding to thePE-connection direction (PE arrangement direction). The “j” correspondsto the parameter of the FORALL syntax shown in FIG. 18.

In the program shown in FIG. 18, the communications for the update ofthe overlap area is the BDDT because the distance vector isbi-directional over a plurality of PEs if the j axis is mapped in the PEarray. When a pair of distance vectors representing the program datadependency is D, the following equation exists.

D={(1,−1), (1,1)}

Next, a conversion matrix T in which the time axis is removed to makethe program loop nest fully permutable is obtained. Since the array isone-dimensional, the conversion matrix T in equation (8) forms a 2×2matrix (2 rows by 2 columns) and the following equation exists.$\begin{matrix}{T = \begin{bmatrix}1 & 0 \\a_{1} & 1\end{bmatrix}} & (11)\end{matrix}$

With T set as shown above, the distance vector (1,−1) and (1, 1) areconverted as follows. $\begin{matrix}{{\begin{bmatrix}1 & 0 \\a_{1} & 1\end{bmatrix}\begin{bmatrix}1 \\{- 1}\end{bmatrix}} = \begin{bmatrix}1 \\{a_{1} - 1}\end{bmatrix}} & (12) \\{{\begin{bmatrix}1 & 0 \\a_{1} & 1\end{bmatrix}\begin{bmatrix}1 \\1\end{bmatrix}} = \begin{bmatrix}1 \\{a_{1} + 1}\end{bmatrix}} & (13)\end{matrix}$

Equation (12) indicates that the distance vector (1, −1) is converted byT into the distance vector (1, a₁−1). Equation (13) indicates that thedistance vector (1, 1) is converted by T into the distance vector (1,a₁+1).

To make the loop nest fully permutable, both components a₁−1 and a₁+1 ofthe converted distance vector should be equal to or larger than 0. Thiscondition is represented by the following equation.

a₁≧1   (14)

The minimum value of a₁, satisfying the conditions to make the loop nestpermutable is 1. The “T” in equation (11) is represented as follows.$\begin{matrix}{T = \begin{bmatrix}1 & 0 \\1 & 1\end{bmatrix}} & (15)\end{matrix}$

The converted distance vectors obtained from equations (12) and (13) are(1, 0) and (1, 2) respectively, and the skew vector S (a scalar in thiscase) is S=a₁=1 by equation (9).

FIG. 23 shows the distance vector obtained by applying the conversionmatrix T of equation (15) to the distance vector shown in FIG. 22. InFIG. 23, the component j of all distance vectors is positive, and thedata over all PEs depends uni-directionally.

If the “T” in equation (15) is applied to (i, j) of the program shown inFIG. 18, the converted values (i′, j′) are obtained by the followingequation. $\begin{matrix}{\begin{bmatrix}i \\j\end{bmatrix} = {{\begin{bmatrix}1 & 0 \\1 & 1\end{bmatrix}\begin{bmatrix}i \\j\end{bmatrix}} = \begin{bmatrix}i \\{i + j}\end{bmatrix}}} & (16)\end{matrix}$

FIG. 19 shows the program obtained by rewriting the virtual array a ofthe program shown in FIG. 18 by the paradigm of a shared memory. Theactual array corresponding to the virtual array a is distributed overplural PEs and statically assigned a memory area.

When the computational space transform is completed, the host computer11 performs an index transformation (step S13). In this process, thedata layout is shifted according to the changes in data dependency.

With the changes in data dependency, the data layout in the memory spaceof each PE should be aligned into the mapping for the calculationprocess. However, it cannot be aligned into the mapping in which thecalculation position is shifted with the static data layout declared inthe data parallel language such as the HPF, thereby disabling the SDDT.As a result, the relationship between the virtual array and the actualarray should be changed with time so that the data alignment to each PEcan be shifted for each iteration of the serial loop.

Assuming that the subscript vector of the m-dimensional virtual array isI_(ν1), I_(ν2), . . . , I_(νm)) and the subscript vector of thecorresponding actual array is I_(p)=(I_(p1), I_(p2), . . . , I_(pm)),the index transform from I_(ν) to I_(p) is represented as follows.$\begin{matrix}{{I_{p} = {I_{v} + {tS}}},{I_{p} = \begin{bmatrix}I_{p1} \\I_{p2} \\\vdots \\I_{pm}\end{bmatrix}},{I_{v} = \begin{bmatrix}I_{v1} \\I_{v2} \\\vdots \\I_{vm}\end{bmatrix}}} & (17)\end{matrix}$

However, the time step t is used for the subscript in the virtual arraybefore update while the time step t+1 is used for the subscript in thevirtual array after update. After performing such indexing processes,the storage positions of all elements in an actual array can be siftedin each time step. However, since the elements at the upper and lowerlimits of the virtual array are not calculated and not updated unlessnew values are assigned, the storage positions should be shifted withthe values of the elements stored. A code is inserted to ensure suchconsistency before applying the index conversion process.

FIG. 20 shows the program obtained by adding to the program shown inFIG. 19 a code to hold an uncalculated element value. In FIG. 20, thecases where j′=i=−1 and j′=i′+26 correspond to the processes of theelement at the upper and lower limit respectively, and a statement isinserted to set these element values constant.

FIG. 21 shows the program obtained by applying the index conversionprocess to the program shown in FIG. 20. In FIG. 21, A shows an actualarray corresponding to the virtual array a. At this time, the conversionof the parameter in the index conversion process is represented asfollows.

J←j′+t   (18)

t←i′+1 for LHS (left side)   (19)

t←i′ for RHS (right side)   (20)

where the conversion by equation (18) replaces all j's in the programwith J after converting j appearing in the equations shown in FIG. 20into j′+t. The conversion by equation (19) substitutes i′+1 for t of theleft part while the conversion by equation (20) substitutes i′ for t ofthe right part.

According to the latest program shown in FIG. 21, the positions of theelements at both ends corresponding to J=i′−1, i′+26 are shifted by 1per time step with their values remaining unchanged. Other elements areupdated using the values of the adjacent elements in each time step, andthe positions are shifted by 1. Thus, the data can be updated using theSDDT as shown in FIG. 16.

After the index conversion process, the host computer 11 inserts a codefor use in restoring the data layout (step S14), and terminates theprocess. To restore data layout refers to a process of returning thestorage position of each element shifted in the data update processusing the SDDT to the initial position specified by the programmer.

For example, in the data update process shown in FIG. 16, the storageposition of the actual array is shifted by 1 to right for each iterationof the serial loop, the position is shifted by i_(f) to the right of theinitial position after i_(f) iterations. If it is returned to theoriginal position by shifting all elements by i_(f) to the left afterall serial loops are processed, then the influences by the shift can beignored in the succeeding processes. Such restoration codes are furtheradded to the program shown in FIG. 21.

In FIG. 16, the data transfer time in a single update process on theoverlap area can be reduced by using the SDDT. However, since theoverlap area is not updated through a calculation, the communicationsare erected for each iteration of the serial loop to update the overlaparea. If the above described extended overlap area is applied to thedata update using the SDDT, the overhead required for the communicationscan be further reduced.

FIG. 24 is an operating flowchart showing the setting process in whichthe extended overlap area is set and data update is set using the SDDT.The setting process shown in FIG. 24 is also performed by the hostcomputer 11 as a part of the optimization process by the compiler 25.When the process starts as shown in FIG. 24, the host computer 11 firstperforms the extended overlap area setting process (step S21) shown inFIG. 11, then performs the data update setting process (step S22) usingthe SDDT shown in FIG. 17, and terminates the process. Thus, the programis generated such that the data can be updated using the SDDT in theextended overlap area.

FIG. 25 shows an example of updating data using both SDDT and extendedoverlap area when the program shown in FIG. 5 is executed. In FIG. 25,the meanings of the dirty overlap area, clean overlap area, local area,INIT, Update, and Iter 1, 2, 3, . . . are the same as those shown inFIG. 6.

For example, at the initial state, PE1 holds the elements in the rangeof a (9:16) in the array a in the local area, and has the storage areafor the elements in the range of a (5:8) as an overlap area (INIT).Then, the PE1 receives the data from the PE0, updates the data in theoverlap area, and transmits the data in the range of a (13:16) to thePE2 (Update).

Then, the PE1 makes the first calculation using the data in the range ofa (5:16), updates the data in a (6:15) (Iter 1), and stores the dataafter shifting the storage position in the communications directionby 1. At this time, the data in a (16) initially stored by the PE1 isupdated in parallel by the PE2. Then, the PE1 makes the secondcalculation using the data in the range of a (6:15), updates the data ina (7:14) (Iter 2), and stores the data after shifting the storageposition in the communications direction by 1. At this time, the data ina (15) initially stored by the PE1 is updated in parallel by the PE2.Since the data in the extended overlap area have become all dirty, thePE1 established uni-directional communications between the PE1 and theadjacent PE, updates the extended overlap area, and performs the dataupdate process for Iter 3.

According to the data update shown in FIG. 25, the data can be locallyupdated twice consecutively after the extended overlap area is updatedthrough the communications. Therefore, the total volume of thetransferred data is the same as that in FIG. 16. However, the overheadrequired for the communications can be reduced. In the example shown inFIG. 25, two overlap areas are added to the left of the overlap area ofeach PE shown in FIG. 16. An extended overlap area provided withadditional overlap areas of an even number can also be used.

According to the present invention, the overhead synchronously used withthe communications can be reduced when a parallel process is performedusing an overlap area in a distributed-memory parallel processor,thereby realizing a high speed parallel process.

What is claimed is:
 1. A program converting device for use in aninformation processing device for converting an input program into aprogram to be executed in a parallel processor comprising a plurality ofprocessing elements and a communications network, comprising: adetecting means for detectingunit to detect in the input program a loopportion in which optimization can be realized using an overlap area; anda setting means for convertingunit to convert the input program byassigning an overlap area to a processing element processing the loopportion and generating a code based on which data in the overlap area iscalculatedupdated through calculations in multiple iterations using onlythe data in the processing element, and for outputting a convertedprogram.
 2. The program converting device according to claim 1, whereinsaid detecting means unit detects as said loop portion a parallel loopencompassed by a serial loop, said parallel loop being able to beoptimized using the overlap area.
 3. The program converting deviceaccording to claim 1, wherein said setting means unit generates a codesuch that data can be processed twice consecutively withoutcommunications after the processing element processing said loop portionupdates the overlap area through communications with an adjacentprocessing element.
 4. The program converting device according to claim1 further comprising: a size determining means for unit determining anoptimum size of the overlap area after estimating process time for saidloop portion, and wherein said setting means unit assigns the overlaparea of a size determined by said size determining means unit to theprocessing element processing said loop portion.
 5. The programconverting device according to claim 4, wherein said size determiningmeans unit determines the optimum size such that the process time forsaid loop portion becomes short.
 6. A program converting device for usein an information processing device for converting an input program intoa program to be executed in a parallel processor comprising a pluralityof processing elements and a communications network, comprising: adetecting means for detectingunit to detect in the input program a loopportion in which optimization can be realized using an overlap area; anda setting means for convertingunit to convert the input program byassigning an overlap area to a processing element processing the loopportion and generating a code based on which data is updated in theoverlap area through single direction data transfer and data in theprocessing element is shifted in a direction of the data transfer, andfor outputting a converted program.
 7. The program converting deviceaccording to claim 6, further comprising: a communications change meansfor changing unit to change data dependency in processing said loopportion from bi-directional transfer to single direction transfer andfor generating a subscript of an array in which the single directiondata transfer can be performed, and wherein said setting means unitgenerates a code for use in updating the data using the subscript insaid array generated by said communications change means unit.
 8. Theprogram converting device according to claim 7, wherein saidcommunications change means unit changes a direction of a distancevector representing the data dependency into single direction amongprocessing elements and generates the subscript of the array using achanged distance vector.
 9. The program converting device according toclaim 7, wherein said communications change means unit generates thesubscript of the array such that a data layout of the processing elementprocessing said loop portion can be shifted in a particular thedirection of the data transfer each time data is processed, shifted datalayout covering two processing elements.
 10. The program convertingdevice according to claim 9, wherein said setting means unit generates acode to return the shifted data layout after processing said loopportion to an original data layout.
 11. A program converting device foruse in an information processing device for converting an input programinto a program to be executed in a parallel processor comprising aplurality of processing elements and a communications network,comprising: a detecting means forunit detecting in the input program aloop portion in which optimization can be realized using an overlaparea; and a setting means forunit converting the input program byassigning an overlap area to a processing element processing the loopportion and generating a code based on which data in the overlap area iscalculatedupdated through calculations in multiple iterations using onlythe data in the processing element and a code passed on which the datain the overlap area is updated through single direction data transfer,and for outputting a converted program.